The purpose of this tutorial is to introduce researchers at the Laboratory of Genomic Epidemiology of Malaria at Harvard university in the analysis of genotyping information generated through amplicon sequencing (also called Microhaplotype genotyping). This tutorial will cover:
Importing and handling tables in CIGAR format in R environment.
Adding metadata.
Performance of the genotyping process.
Molecular surveillance of drug resistance.
Monitoring transmission intensity.
Measuring geographic connectivity.
In this tutorial we will use a data set of 1300 samples distributed across 8 sequencing runs as a example. These samples comes from a collaboration with the group of Caucaseco, which has collected samples of P. falciparum from 2020 to 2022 in 5 municipalities located in Colombia. The data set has geographic information, however this information is encoded to protect the origin of the participants.
Our first step will be to call all required packages and functions in the R environment:
source('functions_and_libraries/amplseq_required_libraries.R')
source('functions_and_libraries/amplseq_functions.R')
sourceCpp('functions_and_libraries/hmmloglikelihood.cpp')
This tutorial start with .tsv table in CIGAR format that has been generated by the malaria-amplicon-pipeline of our lab. The output of this pipeline has the following structure:
\(\begin{array}{c|c:c:c:c:c:c} \text{sampleID}&Gene_1,Allele_1&Gene_1,Allele_2&...&Gene_1,Allele_k&... & Gene_m, Allele_{k_m}\\ \hline ID_1 & \text{Read counts} &&& \\ \hdashline ... &&&&\\ \hdashline ID_n &&&& \end{array}\)
Where the \(Allele\) is coded in CIGAR format, typing “.” for the reference allele, \([0-9]*[A,T,C,G]\) for each point mutation observed, and \([0-9]*[D,I]=[A,T,C,G]*\) for indels (Insertion and Deletions). The numbers denotes the position in the amplicon where the polymorphism is located
These CIGAR tables are stored in the server of the lab using the following path structure:
"STUDYNAME/RUNNAME/dada2/run_dada2/CIGARVariants.out.tsv"
As this structure is maintained across all studies in the lab, we
have created a function called read_cigar_tables that only
requires two arguments: paths (name of the folder of the
study or the STUDYNAME), and sample_id_pattern
(a pattern string in the IDs of the samples that differentiate them from
controls). For this particular example, the path will be
"data/sequencing_data/" , while the pattern will be
"ID". In case all ".tvs" files are stored in
one folder, you can use the argument files = list_of_files
instead of the argument paths, where
list_of_files is a vector with the names of all files we
want to read.
cigar_object = read_cigar_tables(paths = "data/sequencing_data/", sample_id_pattern = "ID")
This function generates a S4 cigar object containing two data.frames,
1) a cigar_table, where genotyping information is stored in
cigar format; and 2) a metadata table with 3 variables
(columns), Sample_id which contains samples Id’s,
run which contains the run number (useful for comparing
performance across batches of samples), and typeofSamp
which differentiates between controls and samples of interest. In order
to view these tables use View(cigar_table$cigar_table) or
View(cigar_table$metadata) in the console. If there are
duplicated samples, only the replicated with the highest read depth
across all loci will be keept.
Metadata would come from different sources, sometimes the sample ID
incorporates metadata in their coding system, other times metadata can
be stored in a different table, having one column specifying the sample
ID. In the case of this data set, we will perform the analysis at the
level of Municipality and in intervals of three months starting from the
first day of collection of samples. Then, to add sampling location,
month of collection and other metadata we are going to merge (using the
function left_join()) our metadata table in
the cigar_object to a metadata table from an external
source that contain all the metadata associated with these samples.
Thus, we will upload a metadata table from an external source. For this
purpose we are going to import the file called metadata.csv
into our R environment.
# Upload metadata from an external source
metadata = read.csv('data/metadata.csv')
# Merge the external metadata with our cigar_object
cigar_object@metadata = left_join(cigar_object@metadata,
metadata,
by = 'Sample_id')
Moreover, there are some artifacts that dada2 pipeline
doesn’t resolve well like the generation of chimeras (artificial fusion
of two different haplotypes during the PCR amplification) in polyclonal
samples. That is the case of the alleles PF3D7_1302900,1G
and PF3D7_0612900,215A that are observed in polyclonal
infections only. So we are going to proceed to remove both alleles from
our cigar_table. Additionally, PvDHFR locus
was amplified in order to detect mixed infections with Plasmodium
vivax, so we are going to remove this locus too.
# removing allele PF3D7_1302900,1G----
# this polymorphism is always present as a minor allele in polygenomic samples, increasing the proportion of polygenomic infections to up to 27%
cigar_object@cigar_table = cigar_object@cigar_table[,!grepl("PF3D7_1302900.1G", colnames(cigar_object@cigar_table))]
cigar_object@cigar_table = cigar_object@cigar_table[,!grepl("PF3D7_0612900.215A", colnames(cigar_object@cigar_table))]
# removing locus PvDHFR----
# this loci belongs to P. vivax
cigar_object@cigar_table = cigar_object@cigar_table[,!grepl("PvDHFR", colnames(cigar_object@cigar_table))]
We can use ggplot2 to plot the sample size (number of genotyped samples) by Municipality over time.
# Define the temporal unit of analysis
dates = sort(unique(cigar_object@metadata$Quarter_of_Collection))
# Define the geographic unite of analysis
pop_levels = levels(as.factor(cigar_object@metadata$Subnational_level2))
pop_colors = c("firebrick3", "dodgerblue3", "gold3", "darkseagreen3", "lightsalmon2")
plot_temporal_collection_of_samples = cigar_object@metadata %>%
filter(!is.na(Subnational_level2)) %>% # Remove samples with no geographic information (Controls)
summarise(nsamples= n(), .by = c(Subnational_level2, Quarter_of_Collection))%>%
ggplot(aes(x = Quarter_of_Collection, y = nsamples, fill = factor(Subnational_level2,
levels = pop_levels)))+
geom_col()+
theme_bw()+
scale_fill_manual(values = pop_colors)+
facet_wrap(.~factor(Subnational_level2,
levels = pop_levels),
strip.position = "top", ncol = 5)+
labs(title = 'Number of Samples collected over time',
y = "Collected samples by PCD",
x = "Quarter")+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5),
legend.position = "none")+
scale_x_discrete(limits = dates)
plot_temporal_collection_of_samples
A first step in analyzing the genotypic data is to measure the performance of the genotyping process in terms of the proportion of samples amplified by each locus (amplification rate by locus) and the proportion of loci amplified per each sample (amplification rate per sample). In order to perform that task, we are going to convert our cigar table to a more manageable format called ampseq format, where amplicons or genes are going to be in columns, samples in rows, and the allele in cigar format is going to be in each cell of the matrix together with the read count of that allele in the sample.
\(\begin{array}{c|c:c:c:c} \text{sampleID}&Gene_1&Gene_2&...& Gene_m\\ \hline ID_1 & Allele_{i_{i \in Alleles_{Gene_1}}}\text{:Read count} &&& \\ \hdashline ... &&&&\\ \hdashline ID_n &&&& \end{array}\)
For that purpose we are going to use the function
cigar2ampseq. This function requires 4 arguments. First a
cigar_object (containing the cigar_table and the metadata).
Then markers, which is a table with the set of markers and
their names (in a column named amplicon). This argument is
optional, and in case is NULL, the name of the amplicons
will be extracted from the columns in the cigar_object, but
information regarding chormosome location and length of the amplicon
will be missing. The next arguments are min_abd and
min_ratio, which are the minimum abundance (minimum read
count) required to call an allele and the minimum ratio between the
minor and major alleles in a polyclonal sample. By default these two
arguments are 10 and 0.1 respectively. The last argument is
remove_controls, by default is FALSE, but if
change ti TRUE the information of the controls will be
stored in a separated slot. Thus, this function automatically
differentiate between controls and samples of interest, and its output
is a ampseq_object S4 list with 6 slots: gt
where the genotype information of the samples of interest is stored in a
ampseq format; the metadata of the samples of
interest; controls with all the information of the
controls; markers containing the information of the markers
such as the chromosome location; and other empty slots required for the
next steps such as loci_performance and
pop_summary.
# (alleles and abundance will be in cells)
markers = read.csv("reference/Pfal_3D7/markers.csv")
ampseq = cigar2ampseq(cigar_object, markers = markers, min_abd = 10, min_ratio = .1, remove_controls = T)
Once the data has been converted to the ampseq format, we can explore
the read depth coverage by samples and by marker, and make comparisons
between treatments, geographic regions or sequencing runs. For that we
can use the function plot_coverage() which requires two
arguments, the ampseq object and the variable
from metadata that we want to use to split the information.
coverage_by_municipality = plot_coverage(ampseq, variable = "Subnational_level2")
The output of this function is a list with three plots
coverage_by_municipality$plot_read_depth_heatmap
coverage_by_municipality$plot_read_depth_violin
coverage_by_municipality$plot_read_depth_violin_by_sample
We can also use the function loci_amplification_rate to
measure the proportion of samples that have been amplified by each
amplicon marker. This function requires two arguments, the
ampseq_object and threshold, which defines the
minimum proportion of samples that a marker should amplify in order to
be keep it for further analysis (default 0.65). All markers below this
threshold will be removed and stored in a different slot in
our object.
ampseq_filtered = locus_amplification_rate(ampseq, threshold = .65)
Thus in this data set 128 loci had an amplification rate above 0.65,
and 15 loci were discarded. All discarded loci are stored in the slot
discarded_loci within the ampseq_object. The
discarded loci were: .
# Plot Amplification rate per loci ----
ggdraw()+
draw_plot(ampseq_filtered@plots$amplification_rate_per_locus,
x = 0,
y = 0,
width = 1,
height = .5)+
draw_plot(ampseq_filtered@plots$all_loci_amplification_rate,
x = 0,
y = .5,
width = 1,
height = .5)
Figure 2: Proportion of amplified samples per locus. Top figure shows the distribution of the amplification rate (or proportion of amplified samples) of loci, with the number of loci in the y-axis and the amplification rate in the x-axis. Bottom figure shows the the chromosome location of each locus (chromosomes in y-axis and position in the x-axis) and the gradient color represents its amplification rate.
The next step is to measure the proportion of loci amplified per each
sample, also called the amplification rate of the samples. For that
purpose we are going to use the function
sample_amplification_rate that also requires only two
arguments, the ampseq_object and a threshold,
which is the minmum proportion of loci that a sample should amplified
(by default 0.8).
ampseq_filtered = sample_amplification_rate(ampseq_filtered)
ampseq_filtered@plots$samples_amplification_rate+
theme(axis.text = element_text(size = 12),
axis.title = element_text(size = 12))
Figure 3: Proportion of amplified loci per sample. The figure shows the distribution of the amplification rate (or proportion of amplified loci) of the samples, with the number of samples in the y-axis and the amplification rate in the x-axis.
Surveillance of polymorphisms or allelic variants associated with
resistance against antimalarials is of the utmost importance in public
health. That is why our panel of markers includes 5 genes (9 markers in
total, some genes are genotyped by more than one marker) that have
polymorphisms associated with antimalarial resistance. In this section
we will identify the different haplotypes of these genes present in our
data set, and we will determine the frequency of each haplotype in each
study area and in each quarter of the year. For this we will use the
function drug_resistant_haplotypes which has 7 arguments.
The first argument is the ampseq_object. The second is a
table with the reference alleles for each gene, and the polymorphisms
associated with resistance. The structure of this table is as
follows:
\[\begin{array}{c|c:c:c} \text{Chromosome} & \text{Gene_ID} & \text{Description} & \text{Mutation} & \text{Anotation} \\ \text{Pf3D7_04_v3} & \text{PF3D7_0417200} & dhfr & \text{N51I} & \text{Pyrimethamine resistance} \\ ... \end{array}\]
Where the mutation describes the reference amino acid (sensitive to the drug), the position in the amino acid chain, and finally the amino acid associated with resistance.
The third and fourth arguments refer to the common name of the gene
(or the name by which it is identified in the markers table
in the ampseq_object) and the gene ID in the gff file of
the referential strain (in this case strain 3D7). The fifth and sixth
arguments are the .gff and .fasta files of the genome of the reference
strain, and the last argument, variables, is a vector that
has the columns of the metadata table that will be used to
define the subpopulations and generate the report graph. For this
particular example we will define as
variables = c('Sample_id', 'Subnational_level2', 'Quarter_of_Collection'),
where Sample_id, is the column that we will use to map the
data, Subnational_level2 is the column to define the area
of study, and Quarter_of_Collection defines the time
scale.
The function generates a list with 3 tables and a graph. The first
table contains the haplotypes found in each sample (rows) for each gene
(columns). Each number in the haplotype indicates a position in the
amino acid chain, the letter before the number indicates the reference
allele (with sensitive phenotype), and the letter after the number is
the allele observed in the sample. If the letters are in capital
letters, that allele in that position has been previously reported as
resistant, whereas the letters in lower case indicate that it is the
first time that position has been reported as polymorphic. In case the
sample is polyclonal, each allele observed is separated by the following
symbol |, and the order in which they are reported is based
on their read count. Null data, defined as sections of the gene that
have not been amplified, are completed with the question mark
?.
The second table contains the inferred phenotype for each sample
(rows) for each gene (columns). The third table contains the count of
each haplotype in the population (number of times observed) and its
frequency in the population (defined based on geographic and temporal
information). It is important to note that polyclonal samples are
included in this calculation, so the denominator is defined as the total
number of haplotypes found in the populations. Remember that if we run
this analysis after filtering not amplified samples and loci, some
polymorphism will not be included. To report all mutations included in
our panel of markers associated with drug resistance, we have to recover
that information from the slot discarded_loci.
drug_resistant_haplotypes_plot = drug_resistant_haplotypes(ampseq_filtered,
reference_alleles = 'reference/Pfal_3D7/drugR_alleles.csv',
gene_names = c('PfDHFR',
'PfMDR1',
'PfDHPS',
'PfKelch13',
'PF3D7_1447900'),
gene_ids = c('PF3D7_0417200',
'PF3D7_0523000',
'PF3D7_0810800',
'PF3D7_1343700',
'PF3D7_1447900'),
gff_file = "reference/Pfal_3D7/PlasmoDB-59_Pfalciparum3D7.gff",
fasta_file = "reference/Pfal_3D7/PlasmoDB-59_Pfalciparum3D7_Genome.fasta",
variables = c('Sample_id', 'Subnational_level2', 'Quarter_of_Collection'),
na.var.rm = TRUE,
filters = c('Subnational_level2;Municipality 1,Municipality 2,Municipality 3,Municipality 4,Municipality 5', 'Quarter_of_Collection;2020-Q4,2021-Q1,2021-Q2,2021-Q3,2021-Q4,2022-Q1,2022-Q2,2022-Q3'))
The colors assigned to the haplotypes in the bar blot are randomly chosen, see below:
drug_resistant_haplotypes_plot$haplo_freq_plot
However, we can manually assign colors based on the resistance profile of the haplotye:
blues = brewer.pal(9, 'Blues')
reds = brewer.pal(9, 'Reds')
drug_resistant_haplotypes_plot$haplo_freq_plot+
theme(axis.text.x = element_text(angle = 90))+
scale_fill_manual(values = c(
'gray75',blues[6], reds[3], # MDR2
reds[c(8,7)], blues[3], reds[6], blues[6], reds[c(9,3)], # DFHR
'gray75',blues[4], reds[5], blues[c(2,6)], reds[c(4,9,6,8)], # DHPS
blues[6], #K13
'gray75', reds[c(5,4)], blues[2], reds[6] # MDR1
))+
labs(y = 'Haplotype prevalence in infected individuals')
Figure 4: Frequency of haplotypes of genes associated with drug resistance. y-axis shows the frequency in the sub-population, x-axis shows the quarter of the year where the sample was collected, rows panels corresponds to the five study areas, and column panels represents each genotyped gene. Colors were assigned based on the possible phenotype, dark blue indicates sensitive, while dark red indicates resistance. Light colors where assigned for partial haplotype information, and gray for complete missging data
The transmission of Malaria has an effect on the evolutionary processes that affect the parasite population. Thus, various genetic indicators are used to identify changes in the intensity of transmission at a spatial and temporal level. One of these indicators is the prevalence of polyclonal infections, or the coexistence of two or more different clones (haplotypes) within the infected individual. As the human parasite genome is haploid, the presence of a single haplotype is expected. The presence of two or more different haplotypes in the sample may be due to the transmission of two or more different genotypes from the same bite (co-infection), or the acquisition of parasites by multiple infections or independent bites (super-infection); and both processes are related to the intensity of transmission. In low transmission there are few infective bites, so there is little chance that a person can acquire a super-infection or co-infection, and therefore most infections are monoclonal. On the other hand, when the necessary conditions exist for the increase in the mosquito population and the interaction between it and the human is favored, super-infections and co-infections increase, generating an increase in the prevalence of polyclonal infections.
The genetic relatedness between parasites is also affected by the intensity of transmission. As we mentioned previously, in low transmission scenarios monoclonal infections are favored, so when an infected person is bitten, a single haplotype is transmitted and self-fertilization of the parasite occurs in the mosquito (sexual reproduction of two genetically identical or related haplotypes). This mechanism of propagation by inbreeding is called clonal propagation, and it causes all the alleles of a haplotype to segregate together and not independently, thus increasing the genetic relatedness in the population.
The parasite effective population size (number of individuals in the population that explains the magnitude of the effect of genetic drift in the population) and its diversity, are also influenced by transmission. The increase in transmission is a consequence of the reproductive success of the different lineages of circulating parasites and the increase in their effective population size. These circumstances increase the introduction of new variants in the population (increase in the population mutation rate or \(\theta\)), the number of polymorphic sites, the richness of alleles of each polymorphic site (number of alleles per locus), and genetic diversity in terms of heterozygosity of the population (probability of selecting two different alleles by chance). On the other hand, when transmission decreases and the effective population size is reduced, and genetic drift causes the random fixation of alleles, and the diversity of the population is reduced. However, the effective size of the population and its diversity are also strongly influenced by the introduction of new variants as a result of migration and the genetic structure of the parasite population, so the relationship between these metrics and transmission is not linear.
In this section we will focus on describing the relationship between
the intensity of transmission, inferred through the number of samples
collected, and the proportion of polyclonal infections. To do this, we
will first use the function get_polygenomic to identify
polyclonal infections, determine their complexity of infection (number
of different clones coexisting in the sample), and the number of
heterozygous loci per sample. This information is automatically added to
the metadata table of the ampseq_object. We will then
proceed to describe the distribution of heterozygous loci per sample,
the COI, and the proportion of polyclonal infections in each population.
Finally, we will describe the temporal change in the proportion of
polyclonal infections and its association with the number of samples
collected.
ampseq_filtered = get_polygenomic(ampseq_object = ampseq_filtered,
strata = "Subnational_level2",
na.rm = FALSE,
filters = NULL)
The function get_polygenomic generates 3 tables:
update_popsummary = T. This
information is useful in order to identify genotyping errors that are
not filtered out in the denoising pipeline and the identification of
chimeras by dada2. That is the case of the alleles
PF3D7_1302900,1G and PF3D7_0612900,215A for
instance.ampseq_filtered@loci_performance %>%
select(loci, prop_poly_detected)%>%datatable(rownames = F)%>%
formatRound(columns='prop_poly_detected', digits=3)
This table is added to the metadata table in our ampseq_object if we
set update_popsummary = T.
ampseq_filtered@metadata %>%
select(Sample_id,NPolyLoci, Polyloci, alleles_at_loci, nalleles_per_loci, coi)%>%
datatable(rownames = F)
ampseq_filtered@pop_summary%>%
datatable(rownames = F) %>%
formatRound(columns=c('mean_coi', 'prop_poly', 'prop_poly_lower', 'prop_poly_upper'),
digits=3)
We can visually inspect all this information by using histograms. Here is an example with the number of heterozygous loci per sample by sampling location. We will use the function log_scale_histogram because the number of samples with only one heterozygous loci is high with respect to the number of samples with more than 10 heterozygous loci.
plot_log_NPolyLoci_by_pop = log_scale_histogram(data = ampseq_filtered@metadata[ampseq_filtered@metadata$NPolyLoci != 0,],
var = "NPolyLoci", binwidth = 1, group_by = "Subnational_level2",
levels = pop_levels, x_label = "Number of heterozygous loci per sample",
fill_color = pop_colors,
y_breaks = c(1, 5,10, 30), ncol = 4)+
labs(title = 'Distribution of # heterozygous loci per sample',
y = "# Samples",
x = "# heterozygous loci per sample")
plot_log_NPolyLoci_by_pop
Figure 5: Distribution of the number of heterozygous loci per samples per population.
We can do the same with the COI by sample.
plot_log_coi_by_pop = log_scale_histogram(data = ampseq_filtered@metadata,
var = "coi",
binwidth = 1,
group_by = "Subnational_level2",
levels = pop_levels,
x_label = "COI",
fill_color = pop_colors,
y_breaks = c(1,2, 5,10, 20, 50, 100,200, 400), ncol = 5)+
labs(title = 'Distribution of COI by sampling location',
y = "# Samples")
plot_log_coi_by_pop
Figure 6: Distribution of the complexity of infection (COI) by study area. y-axis is in logarithm scale.
In the case of the proportion of polyclonal infections by study area, we can use a a barplot.
plot_poly_by_pop = ampseq_filtered@pop_summary %>% ggplot(aes(x = factor(pop, levels = c(pop_levels, "Total")),
y = prop_poly,
fill = factor(pop, levels = c(pop_levels, "Total"))))+
geom_col(alpha = .6) +
geom_errorbar(aes(ymin = prop_poly_lower, ymax = prop_poly_upper), width = .2)+
theme_bw() +
labs(title = "Frequency of polyclonal infections",
y = "Frecquency") +
scale_fill_manual(values = c(pop_colors, "gray30"))+
theme(axis.text = element_text(size = 12),
axis.title = element_blank(),
legend.position = "none")
plot_poly_by_pop
Figure 7: Proportion of polyclonal infections per study area.
We can also explore temporal trends in the proportion of polyclonal
infections. First we are going to create a new variable in our metadata
concatenating the information of the study area
(Subnational_level2) and the information of the
Quarter_of_Collection.
ampseq_filtered@metadata %<>% mutate(Pop_quarter = paste(Subnational_level2, Quarter_of_Collection, sep = '_'))
Then we will use again the function get_polygenomic,
using our new variable in the slot of strata and the slot
filters equals to
'Municipality 1|Municipality 2|Municipality 3', in this way
we will just calculate the proportion of polyclonal infections for the
populations that have samples for most time points. As we don’t want to
overwrite our previous analysis, we will set
update_popsummary = F and store the output in a new
object.
poly_by_pop_over_time = get_polygenomic(ampseq_object = ampseq_filtered,
strata = "Pop_quarter",
update_popsummary = F,
na.rm = TRUE,
filters = 'Municipality 1|Municipality 2|Municipality 3')
poly_by_pop_over_time
## pop n mean_coi n_poly prop_poly prop_poly_lower
## 1 Total 1059 1.115203 118 0.11142587 0.0931027507
## 2 Municipality 1_2020-Q3 2 1.000000 0 0.00000000 0.0000000000
## 3 Municipality 1_2020-Q4 11 1.000000 0 0.00000000 0.0000000000
## 4 Municipality 1_2021-Q1 13 1.000000 0 0.00000000 0.0000000000
## 5 Municipality 1_2021-Q2 22 1.045455 1 0.04545455 0.0011501475
## 6 Municipality 1_2021-Q3 56 1.017857 1 0.01785714 0.0004520015
## 7 Municipality 1_2021-Q4 99 1.090909 9 0.09090909 0.0424164701
## 8 Municipality 1_2022-Q1 102 1.068627 7 0.06862745 0.0280352646
## 9 Municipality 1_2022-Q2 119 1.075630 8 0.06722689 0.0294685350
## 10 Municipality 1_2022-Q3 49 1.183673 9 0.18367347 0.0875903353
## 11 Municipality 2_2020-Q4 29 1.103448 3 0.10344828 0.0218637368
## 12 Municipality 2_2021-Q1 30 1.033333 1 0.03333333 0.0008435709
## 13 Municipality 2_2021-Q2 11 1.000000 0 0.00000000 0.0000000000
## 14 Municipality 2_2021-Q3 38 1.026316 1 0.02631579 0.0006660362
## 15 Municipality 2_2021-Q4 16 1.062500 1 0.06250000 0.0015811117
## 16 Municipality 2_2022-Q1 13 1.000000 0 0.00000000 0.0000000000
## 17 Municipality 2_2022-Q2 23 1.043478 1 0.04347826 0.0011001686
## 18 Municipality 2_2022-Q3 34 1.176471 5 0.14705882 0.0495284553
## 19 Municipality 3_2020-Q3 13 1.230769 3 0.23076923 0.0503810735
## 20 Municipality 3_2020-Q4 51 1.196078 10 0.19607843 0.0982434165
## 21 Municipality 3_2021-Q1 61 1.278689 17 0.27868852 0.1714720843
## 22 Municipality 3_2021-Q2 25 1.120000 3 0.12000000 0.0254653966
## 23 Municipality 3_2021-Q3 27 1.222222 5 0.18518519 0.0630000065
## 24 Municipality 3_2021-Q4 19 1.210526 4 0.21052632 0.0605245377
## 25 Municipality 3_2022-Q1 29 1.068966 1 0.03448276 0.0008726469
## 26 Municipality 3_2022-Q2 110 1.136364 15 0.13636364 0.0783762502
## 27 Municipality 3_2022-Q3 57 1.228070 13 0.22807018 0.1273971076
## prop_poly_upper
## 1 0.1319392
## 2 0.8418861
## 3 0.2849142
## 4 0.2470526
## 5 0.2284444
## 6 0.0955259
## 7 0.1655690
## 8 0.1362970
## 9 0.1281695
## 10 0.3202212
## 11 0.2735152
## 12 0.1721695
## 13 0.2849142
## 14 0.1380990
## 15 0.3023207
## 16 0.2470526
## 17 0.2194866
## 18 0.3105657
## 19 0.5381315
## 20 0.3311574
## 21 0.4082875
## 22 0.3121903
## 23 0.3808299
## 24 0.4556531
## 25 0.1776443
## 26 0.2149201
## 27 0.3583634
Now we can generate a bar plot using ggplot
plot_poly_by_pop_over_time = poly_by_pop_over_time %>%
filter(pop != 'Total')%>%
mutate(
Population = stringr::str_split(pop, '_', simplify = TRUE)[,1],
Date = stringr::str_split(pop, '_', simplify = TRUE)[,2],
prop_poly_lower = case_when(
prop_poly == 0 ~ 0,
prop_poly != 0 ~ prop_poly_lower),
prop_poly_upper = case_when(
prop_poly == 0 ~ 0,
prop_poly != 0 ~ prop_poly_upper)
)%>%
ggplot(aes(x = Date,
y = prop_poly,
ymin = prop_poly_lower,
ymax = prop_poly_upper,
fill = factor(Population, levels = pop_levels)))+
geom_col()+
geom_errorbar(width = .2)+
facet_wrap(~factor(Population, levels = pop_levels), ncol = 5)+
theme_bw()+
scale_fill_manual(values = pop_colors)+
labs(title = 'Temporal change of the proportion of polyclonal infections',
y = "Polyclonal infections",
x = "Date of Collection")+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5),
legend.position = "none")
plot_poly_by_pop_over_time
Figure 8: Proportion of polyclonal infections per study area over year quarters.
Finally, we can test if there is correlation between the proportion of polyclonal infections and the number of collected samples per temporal unit (as an approximation of the malaria burden).
For that we will joint our table poly_by_pop_over_time
with a summary of our metadata table.
nsamples_vs_proppoly =
left_join(poly_by_pop_over_time%>%
select(pop, prop_poly), metadata %>%
mutate(Quarter_of_Collection =
paste(substr(Date_of_Collection, 1, 4), quarters.Date(Date_of_Collection), sep='-')) %>%
summarise(nsamples= n(), .by = c(Subnational_level2, Quarter_of_Collection)) %>%
mutate(pop = paste(Subnational_level2, Quarter_of_Collection, sep = '_')), by = 'pop') %>%
filter(Subnational_level2 %in% c('Municipality 1', 'Municipality 2', 'Municipality 3'))
Then we can plot the correlation using the function
ggscatter.
plot_cor_proppoly_nsamples = ggscatter(nsamples_vs_proppoly, x = "prop_poly", y = "nsamples",
add = "reg.line", conf.int = TRUE,
cor.coef = TRUE, cor.method = "spearman",
title = '# Samples vs. Prop. Poly',
xlab = "Proportion of polyclonal samples", ylab = "Number of samples", facet.by = 'Subnational_level2')
plot_cor_proppoly_nsamples
Figure 9: Spearman correlation of the number of collected samples and proportion of polyclonal infections per year quarters. Each point represent the measurement of each year quarter in each study area.
Malaria transmission is heterogeneous throughout the Colombian territory, and it is mainly concentrated in transmission foci where demographic and environmental factors favor the interaction between the two hosts of the parasite, humans and Anopheles mosquitoes. Operationally, disease control units are defined based on sub-national administrative divisions. The connectivity or dispersal pattern of the parasite between these geographic units may limit progress in disease control if activities are not properly coordinated. This connectivity is mainly influenced by human movement, and can be inferred by estimating the proportion of genetically closely related parasites between the different areas.
We can use again the function
plot_relatedness_distribution to create histograms of the
diistribution of relatenes between samples from different study sites.
In this case, we are going to set
type_pop_comparison = 'between'.
plot_relatedness_distribution_between = plot_relatedness_distribution(
pairwise_relatedness = pairwise_relatedness,
metadata = ampseq_filtered@metadata %>% mutate(Pop = gsub('unicipality ','', Subnational_level2)),
Population = 'Pop',
fill_color = rep('gray50', 10),
type_pop_comparison = 'between',
ncol = 5,
pop_levels = c("M1-M2", "M1-M3", "M1-M4",
"M1-M5", "M2-M3", "M2-M4",
"M2-M5", "M3-M4", "M3-M5",
"M4-M5")
)
plot_relatedness_distribution_between$plot
Figure 15: Histogram distribution of pairwise genetic relatedness between sampling locations. Panels in columns correspond to the three study areas. The dotted vertical lines represent the average genetic relatedness in the whole population.
Again the fraction of highly related haplotypes between sites can be
explored with the function plot_frac_highly_related and
setting type_pop_comparison = 'between'.
plot_frac_highly_related_between = plot_frac_highly_related(
pairwise_relatedness = pairwise_relatedness,
metadata = ampseq_filtered@metadata %>% mutate(Pop = gsub('unicipality ','', Subnational_level2)),
Population = 'Pop',
fill_color = rep('gray50', 10),
threshold = 0.99, type_pop_comparison = 'between',
pop_levels = c("M1-M2", "M1-M3", "M1-M4",
"M1-M5", "M2-M3", "M2-M4",
"M2-M5", "M3-M4", "M3-M5",
"M4-M5"))
plot_frac_highly_related_between$plot
Figure 8: Proportion of highly related samples between sampling locations.